Skip to content

[TRTLLM-12347][feat] enable VSA in VisualGen#14280

Open
o-stoner wants to merge 17 commits into
NVIDIA:mainfrom
o-stoner:user/o-stoner/visual-gen-vsa
Open

[TRTLLM-12347][feat] enable VSA in VisualGen#14280
o-stoner wants to merge 17 commits into
NVIDIA:mainfrom
o-stoner:user/o-stoner/visual-gen-vsa

Conversation

@o-stoner

@o-stoner o-stoner commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added Video Sparse Attention (VSA) algorithm for visual generation models, enabling efficient sparse attention computation on Blackwell GPUs for supported Wan pipelines.
    • Introduced flow_shift parameter to override scheduler configuration in Wan pipelines during inference (allows us to have an apples-to-apples quality comparison with FastVideo, where the Wan pipelines have different flow_shift values that what exists by default in the scheduler).
    • Added VideoSparseAttentionConfig for controlling VSA sparsity levels.
  • Enhancements

    • Extended attention mechanisms to support configurable gate tensors for fine-grained attention control.
    • Improved sparse attention validation across multiple pipeline variants (FLUX, FLUX.2, LTX-2, Wan).

Description

Adds VSA attention backend for TRT-LLM VisualGen based on the following VSA paper. Integrates the B200 CuteDSL kernel here. Currently, this backend is supported for Wan 2.1 using the following fine-tuned model from FastVideo. This support will be extended to Wan 2.2 T2V 14B / TI2V 5B once ModelOpt fine-tuned weights are ready.

Quality/perf findings are summarized on the page here, and quality against H200 FastVideo with the same input noisy latent/flow_shift value are summarized here.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #49010 [ run ] triggered by Bot. Commit: bc6138d Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #49010 [ run ] completed with state SUCCESS. Commit: bc6138d
/LLM/main/L0_MergeRequest_PR pipeline #38748 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #49483 [ run ] triggered by Bot. Commit: 9cc9858 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #49483 [ run ] completed with state SUCCESS. Commit: 9cc9858
/LLM/main/L0_MergeRequest_PR pipeline #39123 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Comment thread tensorrt_llm/_torch/visual_gen/config.py Outdated
@xrq-phys

Copy link
Copy Markdown
Collaborator

Suggested restructuring: reuse the CUTEDSL backend (from PR #13721) and split VSA-specific branches into a sparsity sub-config + kernel sub-directory

Hi, @o-stoner ! I checked in with the @zhenhuaw-me today and we'd like to coordinate the VSA integration so it composes with the CuTe-DSL backend that #13721 is about to land. Below is a concrete restructuring proposal — happy to discuss alternatives if any of these don't fit your kernel's constraints.

Context (what PR #13721 brings):

  • A new AttentionConfig.backend = "CUTEDSL" choice, served by tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py.
  • A directory tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/attention/ holding the packaged cubins (cubins/) and a thin Python runner (fmha.py) that resolves and launches them.
  • A new optional AttentionConfig.quant_attention_config sub-config (Optional[QuantAttentionConfig]) that turns on QK16PV8 (and, on TRTLLM, SAGE). Backend init reads it; absence means "default behavior".

Requested changes for #14280:

  1. Drop the new "VSA" backend literal; reuse "CUTEDSL".
    Given that VSA is a "CuTe DSL attention kernel with sparsity", it can be considered as the same backend. Concretely:

    • In config.py (after CuTe DSL lands), VSA does not need to extend backend's Literal[...] set — it'll already include "CUTEDSL".
    • Drop VSAAttentionBackend as a separate class registered in get_visual_gen_attention_backend. The factory keeps returning CuTeDSLAttention for "CUTEDSL"; that class (implemented in tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py) becomes the dispatcher between quantized attention and sparse attention.
  2. Move the VSA CuTe DSL kernel source to tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/.
    Instead of landing the kernels in cute_dsl_kernels/video_sparse_attention/, convention like cute_dsl_kernels/blackwell/<feature>/ could be better (see tensorrt_llm/_torch/cute_dsl_kernels/blackwell/ for LLM-side directory structures), so:

    • cute_dsl_kernels/blackwell/attention/ — packaged cubins + runner for dense / QK16PV8 (PR 13721)
    • cute_dsl_kernels/blackwell/video_sparse_attention/ — VSA JIT source + interface (this PR)
  3. Add an Optional[SparseAttentionConfig] field on AttentionConfig next to quant_attention_config.
    Mirror the pattern PR 13721 introduced: a sub-config that is None by default; setting it switches the backend into the sparse path. Something like:

    class SparseAttentionConfig(StrictBaseModel):
        """Sparse-attention recipe (CUTEDSL backend / VSA only)."""
        vsa_sparsity: float = Field(0.875, ge=0.0, le=1.0, ...)
        skip_softmax_threshold: float = Field(0.0, ge=0.0)
    
    class AttentionConfig(StrictBaseModel):
        backend: Literal["VANILLA", "TRTLLM", "FA4", "CUTEDSL"] = ...
        quant_attention_config: Optional[QuantAttentionConfig] = None     # from #13721
        sparse_attention_config: Optional[SparseAttentionConfig] = None   # new in this PR
  4. Move the VSA caller-side wrapper into cute_dsl.py; let CuTeDSLAttention dispatch between the two CuTe DSL kernels.
    Sketch (pseudocode, names are negotiable):

    # attention_backend/cute_dsl.py
    class CuTeDSLAttention(AttentionBackend):
        def __init__(self, ..., quant_attention_config=None, sparse_attention_config=None, **kw):
            # Mutually exclusive: at most one of quant_/sparse_attention_config is set
            # (config-level validator will enforces this).
            self.quant_attention_config = quant_attention_config
            self.sparse_attention_config = sparse_attention_config
            ...
    
        def forward(self, q, k, v, *, gate_compress=None, gate_fine=None, **kw):
            if self.sparse_attention_config is not None:
                # VSA path: tile / coarse-pool / topk / block-sparse JIT kernel
                return self._forward_vsa(q, k, v, gate_compress=gate_compress, gate_fine=gate_fine, **kw)
            # Standard dense / QK16PV8 path: packaged cubins
            return self._forward_dense(q, k, v, **kw)
    • _forward_dense is what cute_dsl.py already does in PR 13721 (calls into cute_dsl_kernels/blackwell/attention/).
    • _forward_vsa holds today's VSAAttentionBackend.forward body and calls into cute_dsl_kernels/blackwell/video_sparse_attention/.
    • The tiling / metadata builder (VSAMetadata, VSAMetadataBuilder, set_vsa_forward_context) can stay in cute_dsl.py alongside the dispatcher, or live in cute_dsl_kernels/blackwell/video_sparse_attention/interface.py and be imported lazily — runtime call.

    This way, callers see exactly one "CUTEDSL" backend; the sparse vs dense decision is data-driven (presence of sparse_attention_config) rather than a third backend name.

Let me know what you think — or if any of these conflict with constraints I'm missing (e.g., kernel availability, config, etc.).

@o-stoner o-stoner force-pushed the user/o-stoner/visual-gen-vsa branch from 9cc9858 to 22b2f5d Compare June 1, 2026 22:24
@o-stoner

o-stoner commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51445 [ run ] triggered by Bot. Commit: 8e3e4a9 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51445 [ run ] completed with state SUCCESS. Commit: 8e3e4a9
/LLM/main/L0_MergeRequest_PR pipeline #40853 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@o-stoner

o-stoner commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@o-stoner o-stoner marked this pull request as ready for review June 2, 2026 16:50
@o-stoner o-stoner requested review from a team as code owners June 2, 2026 16:50
@o-stoner o-stoner requested a review from hchings June 2, 2026 16:50
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51646 [ run ] triggered by Bot. Commit: 602e090 Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This pull request adds comprehensive Video Sparse Attention (VSA) support to TensorRT-LLM's visual generation framework for Blackwell GPUs. It includes a new CUTE DSL persistent kernel with custom scheduler and PTX primitives, integration into CuTeDSLAttention and distributed attention backends, Wan pipeline orchestration with per-step metadata building, and extensive test coverage validating correctness, equivalence, performance, and multi-GPU distributed execution.

Changes

VSA Configuration and Type System

Layer / File(s) Summary
VideoSparseAttentionConfig and sparse attention discriminator
tensorrt_llm/visual_gen/sparse_attention.py, tensorrt_llm/visual_gen/args.py, tensorrt_llm/visual_gen/__init__.py
Adds VideoSparseAttentionConfig Pydantic type with vsa_sparsity (0.0-1.0) parameter, updates SparseAttentionConfig union to discriminate between skip-softmax and vsa algorithms, and introduces AttentionConfig validators enforcing backend-algorithm compatibility and mutual exclusivity of quant and sparse configs.
Flow shift scheduler override parameter
tensorrt_llm/visual_gen/params.py
Adds optional flow_shift field to VisualGenParams to override scheduler's flow-matching shift per-pipeline-variant.

CuTe DSL Persistent Kernel Implementation

Layer / File(s) Summary
Static persistent tile scheduler for VSA
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/scheduler.py
Implements WorkTileInfo, ParamsBase, TileSchedulerParams, and StaticPersistentScheduler for managing 3D tile space (blocks × heads × batches) scheduling and persistent work distribution across SM blocks with divmod-based tile-to-coordinate mapping.
Blackwell PTX-backed math and atomic operations
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/ptx.py
Defines warp reductions (warp_reduction_fmax), atomic ops (shared/global atomicAdd_f32/atomicMax_f32), and exp2 emulation via polynomial evaluation and PTX inline assembly for Blackwell-specific float math.
VideoSparseAttentionForwardGroup2QInterleaveKV persistent kernel
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/block_sparse_attn_dsl_fwd.py
Implements multi-stage CUTE DSL kernel with load (TMA Q/K/V streaming), MMA (block-sparse QK GEMM), softmax (masked running max/sum), correction (LSE + rescaling), and epilogue (TMA O writeback) stages coordinated via warpgroup pipelines and barriers.
CuTe compilation cache and interface
tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/interface.py
Exports is_cute_supported() capability gating, block_sparse_attn_from_indices_cute() kernel entry with per-shape JIT compilation cache, and CUTE_AVAILABLE flag for graceful fallback when CuTe/CUDA dependencies are unavailable.

Backend and Module Integration

Layer / File(s) Summary
CuTeDSLAttention VSA sparse execution path
tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py, tensorrt_llm/_torch/visual_gen/attention_backend/__init__.py
Routes forward() through VSA-specific _forward_vsa() when sparse_attention_config is set; manages per-shape VSAMetadata caching, forward-context stack, tiling/partitioning, coarse cube selection via softmaxed pooling, optional CuTe kernel dispatch with dense SDPA fallback, and gated combination of coarse/fine outputs. Relaxes head-dim constraint to allow non-128 dimensions via runtime fallback.
Attention module VSA gate routing and backend selection
tensorrt_llm/_torch/visual_gen/modules/attention.py
Routes gate_compress/gate_fine from caller layout into backend's expected 4D layout with optional HND-layout transpose; implements VSA backend selection for SEPARATE_QKV mode, validates VSA incompatibility with Attention2D, and forwards gate kwargs to backend.
Ulysses distributed gate tensor handling
tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py
Transforms gate_compress/gate_fine via all_to_all_4d alongside Q/K/V to maintain correct post-A2A sharding layout and applies same transposition rules as inner backend expects.
Backend factory and config wiring
tensorrt_llm/_torch/visual_gen/attention_backend/utils.py
Conditionally passes attention_config.sparse_attention_config into CUTEDSL backend kwargs to enable VSA configuration dispatch.

Wan Pipeline VSA Orchestration

Layer / File(s) Summary
Wan pipeline VSA metadata building and forward context
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py
Computes VSAMetadataBuilder once per forward() call, builds per-step metadata during denoising loop using current timestep, latent shape, and pacing parameters, and wraps transformer forward in set_vsa_forward_context() to make metadata available to attention layers.
Flow shift scheduler override mechanism
tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py, tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py
Accepts optional flow_shift parameter in pipeline forward() and infer(), detects applicable shift key in scheduler config (shift or flow_shift), logs override, and updates via register_to_config() before set_timesteps().
Wan transformer block VSA gate projections
tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py
Conditionally creates to_gate_compress and to_gate_fine Linear projections in WanBlock when VSA is active; computes gate tensors from normalized hidden state during forward and forwards to self-attention via kwargs.
Pipeline variant VSA support validation
tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux.py, tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux2.py, tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py, tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py
Validates sparse_attention_config.algorithm and rejects VSA with informative errors restricting it to Wan 2.1 T2V 14B (720P) for all other pipeline variants.
Pipeline loader VSA logging
tensorrt_llm/_torch/visual_gen/pipeline_loader.py
Logs detailed VSA backend info including CUTE kernel availability and sparsity when CUTEDSL with VSA is enabled.

VSA Test Coverage

Layer / File(s) Summary
CuTe kernel and VSA correctness tests
tests/unittest/_torch/visual_gen/test_attention_cute_dsl_vsa.py
Validates VSA configuration (cross-attention VANILLA fallback, Attention2D incompatibility, sparsity collapse to dense at 0.0), tile/untile round-trip with padding verification, and CuTe kernel matching against dense scaled_dot_product_attention and masked fp32 reference implementations.
VSA integration and equivalence tests
tests/unittest/_torch/visual_gen/test_attention_integration.py
Validates integrated VSA self-attention equivalence to naive dense SDPA at sparsity=0.0 and verifies output finiteness (no NaN/Inf) across multiple sparsity values.
VSA performance benchmarks
tests/unittest/_torch/visual_gen/test_attention_perf.py
Benchmarks VSA module-level vs VANILLA backend on Wan 2.2 T2V 14B production shapes across multiple sparsity values and compares VSA fine-stage kernel directly against FlashAttention 4 performance.
Multi-GPU Ulysses + VSA distributed validation
tests/unittest/_torch/visual_gen/multi_gpu/test_wan_vsa_ulysses.py
Distributed test harness validating Ulysses + VSA forward pass shape/finiteness and correctness via comparison against single-GPU reference with cosine similarity and tolerance-based assertions.
Test configuration and registry
tests/integration/test_lists/test-db/l0_b200.yml
Registers VSA test into l0_b200 CI configuration.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

VisualGen

Suggested reviewers

  • Shixiaowei02
  • kaiyux
  • chang-l
  • Funatiq
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is substantially incomplete. While it mentions the feature being added and references external resources, it lacks specific details about the implementation, architectural changes, and does not provide any test coverage information despite the template section being present. Add a detailed 'Description' section explaining what VSA is, how it integrates with VisualGen, key architectural decisions, and the scope of backend support. Complete the 'Test Coverage' section by listing specific test files and test cases (e.g., test_attention_cute_dsl_vsa.py, test_wan_vsa_ulysses.py) that validate the VSA implementation.
Docstring Coverage ⚠️ Warning Docstring coverage is 28.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly summarizes the main change: enabling VSA (Video Sparse Attention) in VisualGen, which is the core objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py`:
- Around line 201-216: The module-global _vsa_forward_context must be replaced
with a request-local contextvar to avoid cross-request clobbering: create a
contextvars.ContextVar[Optional[VSAMetadata]] (e.g. _vsa_forward_context_var)
and update set_vsa_forward_context to set the ContextVar and yield while
storing/resetting the returned token on exit, and update get_vsa_forward_context
to return _vsa_forward_context_var.get(None); keep the same function/class names
(set_vsa_forward_context, get_vsa_forward_context, VSAMetadata,
_vsa_forward_context -> _vsa_forward_context_var) so callers don’t change.
- Around line 527-541: The CuTe branch currently asserts when num_cubes exceeds
VSA_KERNEL_MAX_CUBES; instead modify the gating so the code falls back to dense
SDPA: include the condition num_cubes <= VSA_KERNEL_MAX_CUBES in the computation
of use_cute (the boolean used to choose the CuTe kernel), and remove or replace
the subsequent assert in the CuTe branch (the block referencing
VSA_KERNEL_MAX_CUBES and num_cubes) so oversized inputs simply skip CuTe and use
the existing dense fallback.

In
`@tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/block_sparse_attn_dsl_fwd.py`:
- Line 416: The hardcoded limit self.max_indices = 4 * 1024 can be exceeded by
variable_block_sizes, causing a shared-memory overflow when copying into
sVariable_block_sizes; add a runtime validation that
variable_block_sizes.shape[0] <= self.max_indices before the copy (or
assert/raise a clear error) and fail fast with a descriptive message, and/or
enforce the check earlier in block_sparse_attn_from_indices_cute in interface.py
so callers cannot pass larger arrays; update any related docs/comments to state
the max_indices constraint and reference the symbols max_indices,
sVariable_block_sizes, variable_block_sizes, and
block_sparse_attn_from_indices_cute when making the change.

In
`@tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/ptx.py`:
- Around line 144-160: The inline assembly in the else branch currently emits a
shared-scope atomic ("atom.relaxed.shared::cta.cta.max.s32") but this path
targets global memory; update the asm string in the llvm.inline_asm call to use
the global scope ("atom.relaxed.global::cta.cta.max.s32") while keeping the same
operand ($0) and constraints, i.e., modify the asm literal passed to
llvm.inline_asm (the triple-quoted string) to replace "shared" with "global" so
the global-memory atomic is emitted.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py`:
- Around line 542-555: The code mutates the shared scheduler config when
applying a user flow_shift override (variables: flow_shift, sched_cfg,
shift_key, self.scheduler.register_to_config), which makes the change persist
across requests; instead, apply the override only request-scoped by either
restoring the original sched_cfg[shift_key] after the request or by creating a
request-local copy of the scheduler/config before calling set_timesteps();
specifically, capture the original value (orig_shift =
sched_cfg.get(shift_key)), call register_to_config only on a
cloned/configured-local scheduler or restore orig_shift via
register_to_config(**{shift_key: orig_shift}) after completing the request so
the shared scheduler config is not permanently mutated.

In `@tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py`:
- Around line 489-503: The current code calls
self.scheduler.register_to_config(...) to apply a per-request flow_shift, which
mutates the shared scheduler and leaks that override to subsequent requests;
instead avoid mutating the shared scheduler by creating a request-local
scheduler/config or restoring the original value after use: either clone the
scheduler or its config (e.g., copy sched_cfg = dict(self.scheduler.config) and
apply the flow_shift to that local config or instantiate a shallow copy of the
scheduler) and use that local scheduler/config before calling set_timesteps(),
or if you must modify self.scheduler temporarily, capture the original
sched_cfg[shift_key] first and restore it immediately after the request
completes; reference flow_shift, sched_cfg, self.scheduler, register_to_config,
and set_timesteps when making the change.

In `@tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py`:
- Around line 369-390: The new VSA gate projections to_gate_compress and
to_gate_fine are created as full-width dense linears on every rank, which
misaligns with attn1's TP-local Q shards and duplicates work when tp_size>1;
change their construction to the same column-parallel/sharded setup used for the
Q projection (i.e., mirror the Q Linear creation parameters: use the same
mapping/partitioning, quant_config, skip_create_weights_in_init,
force_dynamic_quantization, and out-dim q_dim) so each rank only holds its
TP-local slice and the gate tensors line up with attn1's local Q shard. Ensure
you reference and reuse the same sharding/mapping pattern used when creating the
Q projection to_gate (or whichever variable constructs Q) so topology and sizes
match across ranks.

In `@tensorrt_llm/_torch/visual_gen/modules/attention.py`:
- Around line 471-475: The _reshape_gate helper reshapes gate tensors using the
global self.num_attention_heads which desyncs under tensor-parallelism; update
_reshape_gate (used for gate_compress / gate_fine) to compute the head count
from the incoming gate tensor (or use the same local head count used when
reshaping q/k/v) instead of self.num_attention_heads, then apply view/transpose
logic with that derived local_head_count so the final layout matches the
attention tensors and respects backend_layout (AttentionTensorLayout.HND)
handling.

In `@tests/unittest/_torch/visual_gen/test_attention_integration.py`:
- Around line 620-628: After constructing integrated (Attention(...,
config=cfg_vsa)), add an explicit assertion that the VSA path was chosen by
invoking the internal selector or flag (call integrated._build_vsa_setup() or
inspect any backend attribute set by that method) and assert it indicates
CUTEDSL/VSA; e.g., ensure the result/attribute equals the expected VSA backend
before proceeding to use integrated in the test so the test fails if CUTEDSL
silently falls back to dense.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: da759100-77d8-4987-90f4-527b2030545e

📥 Commits

Reviewing files that changed from the base of the PR and between 059de9c and 602e090.

📒 Files selected for processing (26)
  • tensorrt_llm/_torch/visual_gen/attention_backend/__init__.py
  • tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py
  • tensorrt_llm/_torch/visual_gen/attention_backend/parallel.py
  • tensorrt_llm/_torch/visual_gen/attention_backend/utils.py
  • tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/__init__.py
  • tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/block_sparse_attn_dsl_fwd.py
  • tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/interface.py
  • tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/ptx.py
  • tensorrt_llm/_torch/visual_gen/cute_dsl_kernels/blackwell/video_sparse_attention/scheduler.py
  • tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux.py
  • tensorrt_llm/_torch/visual_gen/models/flux/pipeline_flux2.py
  • tensorrt_llm/_torch/visual_gen/models/ltx2/pipeline_ltx2.py
  • tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py
  • tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py
  • tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py
  • tensorrt_llm/_torch/visual_gen/modules/attention.py
  • tensorrt_llm/_torch/visual_gen/pipeline_loader.py
  • tensorrt_llm/visual_gen/__init__.py
  • tensorrt_llm/visual_gen/args.py
  • tensorrt_llm/visual_gen/params.py
  • tensorrt_llm/visual_gen/sparse_attention.py
  • tests/integration/test_lists/test-db/l0_b200.yml
  • tests/unittest/_torch/visual_gen/multi_gpu/test_wan_vsa_ulysses.py
  • tests/unittest/_torch/visual_gen/test_attention_cute_dsl_vsa.py
  • tests/unittest/_torch/visual_gen/test_attention_integration.py
  • tests/unittest/_torch/visual_gen/test_attention_perf.py

Comment thread tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py Outdated
Comment thread tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py Outdated
Comment thread tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan_i2v.py Outdated
Comment thread tensorrt_llm/_torch/visual_gen/models/wan/pipeline_wan.py Outdated
Comment thread tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py
Comment thread tensorrt_llm/_torch/visual_gen/modules/attention.py
Comment thread tests/unittest/_torch/visual_gen/test_attention_integration.py
Comment thread tests/unittest/_torch/visual_gen/test_attention_perf.py
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51646 [ run ] completed with state SUCCESS. Commit: 602e090
/LLM/main/L0_MergeRequest_PR pipeline #41029 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@o-stoner o-stoner requested a review from a team as a code owner June 3, 2026 18:30
@o-stoner o-stoner requested a review from yuxianq June 3, 2026 18:30
@o-stoner o-stoner force-pushed the user/o-stoner/visual-gen-vsa branch from 6fe39d4 to 6079294 Compare June 3, 2026 18:45
@o-stoner

o-stoner commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51898 [ run ] triggered by Bot. Commit: 6079294 Link to invocation

Comment thread tensorrt_llm/_torch/visual_gen/attention_backend/cute_dsl.py Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #51898 [ run ] completed with state FAILURE. Commit: 6079294
/LLM/main/L0_MergeRequest_PR pipeline #41254 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54358 [ run ] triggered by Bot. Commit: 29a6226 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54358 [ run ] completed with state FAILURE. Commit: 29a6226
/LLM/main/L0_MergeRequest_PR pipeline #43428 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>
@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54665 [ run ] triggered by Bot. Commit: be1916c Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54665 [ run ] completed with state FAILURE. Commit: be1916c
/LLM/main/L0_MergeRequest_PR pipeline #43697 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55050 [ run ] triggered by Bot. Commit: f7fcafe Link to invocation

@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55060 [ run ] triggered by Bot. Commit: 3730e37 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55050 [ run ] completed with state ABORTED. Commit: f7fcafe

Link to invocation

@chang-l chang-l left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI coverage for test_wan_vsa_ulysses.py (8-GPU, cfg=2 × ulysses=4) — please confirm it actually runs before merge.

The test is collected via the unittest/_torch/visual_gen/multi_gpu directory entry in l0_dgx_b200.yml, which lives under the system_gpu_count: 8 / stage: post_merge / backend: pytorch condition. Two problems:

--add-multi-gpu-test only adds pre-merge multi-GPU stages, so it will not trigger this. Post-merge tests need /bot run --stage-list "" (or the heavy /bot run --post-merge).

Could you run python scripts/test_to_stage_mapping.py --tests "test_wan_vsa_ulysses" on this branch and confirm which stage runs it, then trigger that stage (e.g. /bot run --stage-list "") and verify it passes before merge?

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55060 [ run ] completed with state FAILURE. Commit: 3730e37
/LLM/main/L0_MergeRequest_PR pipeline #44049 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>
@o-stoner o-stoner force-pushed the user/o-stoner/visual-gen-vsa branch from 3730e37 to 56a9400 Compare June 23, 2026 17:57
Signed-off-by: o-stoner <245287810+o-stoner@users.noreply.github.com>
@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55315 [ run ] triggered by Bot. Commit: 51ed5c5 Link to invocation

@o-stoner

Copy link
Copy Markdown
Collaborator Author

CI coverage for test_wan_vsa_ulysses.py (8-GPU, cfg=2 × ulysses=4) — please confirm it actually runs before merge.

The test is collected via the unittest/_torch/visual_gen/multi_gpu directory entry in l0_dgx_b200.yml, which lives under the system_gpu_count: 8 / stage: post_merge / backend: pytorch condition. Two problems:

--add-multi-gpu-test only adds pre-merge multi-GPU stages, so it will not trigger this. Post-merge tests need /bot run --stage-list "" (or the heavy /bot run --post-merge).

Could you run python scripts/test_to_stage_mapping.py --tests "test_wan_vsa_ulysses" on this branch and confirm which stage runs it, then trigger that stage (e.g. /bot run --stage-list "") and verify it passes before merge?

@chang-l unittest/_torch/visual_gen/multi_gpu/test_wan_vsa_ulysses.py is run in the CI report here under L0_Test-x86_64-Multi-GPU by DGX_B200-8_GPUs-PyTorch-1, so IIUC I think it is being collected and triggered already, but please correct if I am misunderstanding. It failed due to a process kill in the previous run, but I will confirm on the next run whether or not it passes.

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55315 [ run ] completed with state FAILURE. Commit: 51ed5c5
/LLM/main/L0_MergeRequest_PR pipeline #44267 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55536 [ run ] triggered by Bot. Commit: 51ed5c5 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55536 [ run ] completed with state SUCCESS. Commit: 51ed5c5
/LLM/main/L0_MergeRequest_PR pipeline #44463 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Olivia Stoner <245287810+o-stoner@users.noreply.github.com>
@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

1 similar comment
@o-stoner

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55598 [ run ] triggered by Bot. Commit: fa1764e Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55598 [ run ] completed with state FAILURE. Commit: fa1764e
/LLM/main/L0_MergeRequest_PR pipeline #44515 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants